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Abstract 

We develop a fully Bayesian hierarchical model for trend filtering, itself a new 
development in nonparametric, univariate regression. The framework more broadly 
applies to the generalized lasso, but focus is on Bayesian trend filtering. We compare 
two shrinkage priors, double exponential and generalized double Pareto. A simulation 
study, comparing Bayesian trend filtering to the original formulation and a number 
of other popular methods shows our method to improve estimation error while main¬ 
taining if not improving coverage probability. Two time series data sets demonstrate 
Bayesian trend filtering’s robustness to possible violations of its assumptions. 
Keywords: Bayesian analysis, trend filtering, locally adaptive regression splines, non¬ 
parametric 


1 Introduction 

Consider the nonparametric model of the function /q : [0,1] M, with zero mean, indepen¬ 
dent, Gaussian errors e, 

Vi = foixi) + ei, i = (1) 

Assume the observations i/i are generated via /o from the unique inputs xi < ■ ■ ■ < Xn- 
Mammen and van de Geer [1997] propose an estimator of /o, convergent at the minimax 
rate, by penalizing the total variation of the derivative of the function /, taken to be 

t i/<‘)(t 

i+l 

i=l 

with the set of knots {ti,... ,tp+i} equal to the inputs. Since the penalty TV(/^^^) is not 
easily computed, an alternative solution uses an estimate of the total variation penalty. 
Consider the estimator / = (/(xi),..., /(a;„))* of /o = {fo{xi ),..., fo{xn)y that solves 

/ = argmin|E-/||2 +ATV(/(*^)), 

/ 

where y is taken to be the vector of responses. R. J. Tibshirani [2014] proposed such an 
estimator that maintains the optimal (minimax) convergence rate. His method, called trend 
hltering, estimates TV(/*^^^) using the divided difference of order k of /. DeVore and Lorentz 
[1993] and de Boor [2005] provide two excellent references on the divided difference. 
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Trend filtering uses the generalized lasso framework [Tibshirani, 2014], Assume y G M"" 
and a model matrix X G The generalized lasso finds the minimum vector (3 & MP 

constrained by a linear transformation D G 

$ = aigminWy - X(3\\l +X\\D(3\\i, (2) 

P 

for some penalty parameter A > 0. Trend filtering uses the minimization problem in Equa¬ 
tion (2) with X ■= In and by cleverly choosing a penalty matrix D that when coupled with 
the norm recovers an estimate of the total variation penalty. 

We adapt trend filtering, as a special case of the generalized lasso, to the Bayesian set¬ 
ting. Similar efforts to fit lasso type, or other general shrinkage, problems with a hierarchical 
Bayesian model have been made [see Park and Casella, 2008, Hans, 2009, Kyung et ah, 2010, 
Griffin and Brown, 2011, Lee et ah, 2012]. We further investigate Bayesian trend filtering 
with another shrinkage prior, the generalized double Pareto, as developed by Lee et ah [2010] 
and Armagan et ah [2013]. The generalized double Pareto prior, when applied to Bayesian 
trend filtering, appears to estimate the true, underlying function with smaller error and to 
provide better frequentist coverage properties than does the more standard double exponen¬ 
tial prior. 

This paper reads as follows. Section 2 discusses trend hltering, its minimax convergence 
rate and standard errors. We present Bayesian trend filtering and details on its implemen¬ 
tation in Section 3. A simulation study comparing original trend hltering to the Bayesian 
version, and both of these to some popular univariate, nonparametric regression methods, 
is carried out in Section 4.1. Section 4.2 hts these methods to two real data sets. Sec¬ 
tion 5 briehy mentions Bayesian trend hltering’s relation to Gaussian process regression, 
summarizes our hndings, and mentions some future research directions. 

2 Trend Filtering 

Trend hltering, as discussed here, was developed in two stages. Kim et ah [2009] hrst in¬ 
troduced trend hltering for piecewise linear hts, providing a primal-dual interior point 
algorithm to ht the method at a specihed value of the penalty parameter, A. Tibshirani 
[2011] next identihed trend hltering as a special case of the generalized lasso, thus ohering a 
path algorithm that hts trend hltering over all values of the penalty parameter, A G [0, cxd). 
In a separate paper Tibshirani [2014] established convergence properties of the method. In 
the end, trend hltering hts a piecewise polynomial of order k with knots taken to be the set 
of inputs {xi}^. The piecewise polynomial comes from a continuous-time representation of 
the following discrete minimization problem, 

/ = argmm||!,-/||2 + A||D(“’‘+‘)/||,. (3) 

/ 

The discrete diherence matrix depends both on the order of the derivative of the 

function /, as chosen by the practitioner, and the inputs Xi, i = 1,... ,n. The penalty term 
I|^(a;,fc+i) 1 ^ estimates the total variation of the kth derivative of the function /. Because 
of this close relation to locally adaptive regression splines, trend hltering adapts to the 
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fluctuations of the underlying curve and thus achieves the same optimal convergence rate 
[Tibshirani, 2014], 

2.1 Minimax Convergence Rate 

Trend hltering converges at the minimax rate to the true underlying function of interest, 
as is shown by Tibshirani [2014], He used purely algebraic methods to show that trend 
hltering is sufficiently close to the locally adaptive regression spline estimator for the minimax 
convergence rate to carry over to trend hltering. It is well known that this convergence rate 
is not achieved by any estimator linear in y [Donoho and Johnstone, 1998, Tibshirani, 2014]. 
Thus, the optimal convergence rate depends on the norm. 

Another strategy to prove the minimax convergence rate for trend hltering would make 
use of the metric entropy of the underlying function space for which trend hltering hnds 
its solution. This strategy is analogous to that of Mammen and van de Geer [1997] for 
locally adaptive regression splines. Tibshirani [2014] briehy mentions the difficulty of such a 
proof. The requisite interpolating properties of trend hltering are quite difficult to establish 
because trend hltering uses piecewise polynomials with potentially discontinuous lower order 
derivatives. We mention this alternative strategy, not because we were able to over-come 
the aforementioned difficulties, but instead to highlight how Bayesian trend hltering could 
share metric entropy proofs of convergence rates across penalized regression and Gaussian 
process regression methods. This connection is discussed further in Section 5. 

2.2 Standard Errors 

As is summarized in Kyung et ah [2010], estimating the standard errors of lasso problems is 
quite difficult. This is also seen by the signihcant amount of attention paid to the problem; 
for example, see [Tibshirani, 1996, Knight and Fu, 2000, Osborne et ah, 2000, Fan and Li, 
2001, Potscher and Leeb, 2009]. However, with trend hltering the theory cited here does 
not hnd easy evidence. Figure 1 displays 2.5% and 97.5% quantiles of bootstrapped func¬ 
tion evaluations at each input, for a simple piecewise linear function (solid red). The 95% 
conhdence intervals (green dash) provide seemingly reasonable estimates of error. The only 
obvious problem seen in Figure 1 is the bias indicated by the histogram of the bootstrapped 
estimates of f 25 - 
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Figure 1: Left: A true piecewise linear function (solid red) with 95% bootstrap confidence 
intervals of a trend filtering fit, with Acy(io) chosen by 10-fold cross validation. Right: A 
histogram of the 500 bootstrapped values of / 25 , with the true value of /25 denoted by a red 
vertical line. 


Cross validation, the recommended method to estimate A, is both computationally costly 
and is well known to encourage over-fitting [Davison, 1997]. If you are willing to trade 
speed for an inexact solution, it is reasonably easy to overcome the computational cost of 
bootstrapping after one estimates the penalty parameter. For instance, we could estimate 
a trend filtering fit using the majorization-minimization techniques of Hunter and Li [2005]. 
An approximation to objective function (3), details of which are provided in Appendix A, 
quickly allows us to refit trend hltering given a new set of observations y* and a pre-calculated 
penalty parameter, say Acy(io)) estimated from 10-fold cross validation. Though this strategy 
is itself limited, it is quite fast. For a hxed value of A, it took a MacBook Pro 3.1 GHz 
Intel Core i7 just about half the time to calculate 500 bootstrapped approximations of / in 
Figure 1, with n = 50, as it did to fit the exact solution of trend filtering and estimate Acy(io)- 
In general, we found this approximation strategy to produce estimates relatively close to the 
exact fit. It was common to see mean absolute differences on the order of 10“^. However, 
cross validation for trend filtering frequently produces estimates of the true function that 
are too “wiggly.” Bootstrap methods that then rely on \cv will also produce poor results. 
The simulations and especially the real data sets presented in Section 4.1 highlight this 
point exactly. When the estimates of / under Xcv are too wiggly, bootstrapped confidence 
intervals fail to contain the true function across the sampled domain, thus lowering nominal 
coverage probability. 

The problem with choosing A, in our experience, largely disappears under the Bayesian 
approach. Bayesian trend filtering estimates A by incorporating it into the hierarchical 
model (4) so that the estimates of / are marginalized over all values of A. This encourages 
a more robust, stable estimate of /. The results in Section 4.1 justify our conclusions about 
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both trend filtering and Bayesian trend filtering. 


3 Bayesian Trend Filtering 


3.1 Formulation 

Similar to the work of Park and Casella [2008], Kyung et al. [2010], Griffin and Brown [2011], 
and Armagan et ah [2013], a fully Bayesian hierarchical model for Bayesian trend hltering 
can be expressed as a scale mixture of normals, a mixture that was first discovered by 
Andrews and Mallows [1974], The resulting hierarchical model puts a double exponential 
(dexp) 

|/|<T|ocexp ([-l||D(-.'-'+‘)/||d 

conditional prior on the penalty term f\\i. Park and Casella [2008] show that the 

double exponential prior conditional on a ensures that the joint posterior distribution of 
[/, CT^] is unimodal. The fully Bayesian hierarchical model to ht Bayesian trend hltering is 

y\f, -- Mnif, (T^In) 

... ,Un-k-i ~ A4(0,a^S/), 

E7> = = (E''"-'-+‘>)‘diagK',.. 

n-k-l ^2 ^4) 

CJI,... ,a;„_fc_i|A ~ JJ y exp(-A^a;j/2)cicUj, ujj > 0,\fj 
i=i 

X\a, p ^p)dX, A>0 

~ 7r{a‘^)da‘^, > 0. 

The ui,... ,Un-k-i are mutually independent, and throughout we let 7r(cr^) = cr“^, which 
is the limiting improper prior from an inverted gamma distribtion. The distribution on 
A, denoted by highlights the fact that a slight change of the prior will produce two 
different conditional prior distributions. The more common, double exponential conditional 
prior is found by letting -0 be a gamma distribution P(a,p) on A^ [Park and Casella, 2008, 
Kyung et ah, 2010]. Additionally, we consider a variation on the prior developed by Lee et al. 
[2010], Lee et al. [2012], and Armagan et al. [2013] which uses a P(a,p) prior on A instead 
of A^. The generalized double Pareto (gdp) distribution takes on the following form within 
Bayesian trend hltering. 


[fW] 


1 / 1 ||D(^’^+^)/||iA ^ 

2crp/« \ a ap/a J 


The gdp is known to eschew much of the bias otherwise induced by the exponential tails of the 
double exponential distribution [Lee et ah, 2010]. We explore the choice of hyperparameters 
a, p in Section 4.1. 

A simplie Gibbs sampler provides samples from the posterior distributions of interest for 
Bayesian trend hltering. The full conditionals shared by both conditional priors are 
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[/!•] ~ Mn {{In + S7')-'2/, <y\ln + , 


[iMi] ~ inverse-Gaussian 





r-1 


^{y- InfYiy - Inf) + 



Vj, 


The full conditional of for dexp is r(n — fc — 1 + a, YfJj=i ^ '^i/2 + p), and relative to gdP, 
[A|-] ~ r(n - P - 1 + a, \\D^^I+^^f\\i/a + p). 

Model (4) can be viewed as an extension of the work in Kyung et al. [2010]. Therefore, 
we appeal to their Propositions 4.1 and 4.2 which show that the underlying Gibbs sampler is 
geometrically ergodic. The two Gibbs samplers, for dexp and gdp, converge both in theory 
and in practice very quickly. 

Proposition 3.1. The Gibbs sampler for the hierarhcical model (4) is geometrically ergodic. 

We refer the reader to the proof by Kyung et al. [2010]. In Section 5, we show that Propo¬ 
sition 3.1 can help reduce the computational cost of Bayesian trend filtering. 

A few points contrasting trend filtering with Bayesian trend filtering should be noted. 
Trend hltering, by restricting the parameter space of the objective function (3), automatically 
sets some terms in the penalty to exactly zero. Such a data dependent selection of important 
predictors is philosophically appealing. Within trend filtering, this data dependent selection 
corresponds to setting esimates of the terms in the total variation of to zero. Bayesian 
trend hltering, however, never sets any terms identically to zero. Figure 2 compares Bayesian 
trend hltering to the original trend hltering formulation. Bayesian trend hltering doesn’t 
quite predict a piecewise linear function when in fact the true function is a piecewise linear 
function, with three knots at x equal to 20,45, and 80. What Bayesian trend hltering 
sacrihces in knot detection, it makes up for when htting smooth curves. The information 
gained by incorporating A into the Gibbs sampler, and the propagation of that information 
back to the estimates of / provides stable estimation, as is shown in Section 4. Thus, the 
primary advantage of Bayesian trend hltering is as a smoother and not as a knot-detection 
method. 
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Figure 2: A piecewise linear function (solid black) with knots at 20,45, and 80 fit by Bayesian 
trend filtering (red dot-dash) and the original trend hltering (purple dash). 95% credible 
intervals for Bayesian trend filtering (green dash) cover the true function and the frequentist 
trend hltering ht across the entire domain. 

3.2 Computational Considerations 

The very idea of the (generalized) lasso, to shrink some of the elements of the penalty term 
towards zero possibly setting some to exactly zero, can cause computational issues in the 
Bayesian setting. Consider the full conditional [l/cujl-], for either conditional prior dexp 
or gdp. The mean in the inverse-Gaussian distribution inverts exactly that which we are 
shrinking towards zero. Any sample from the posterior distribution such that an element of 
the penalty term is very close to zero, threatens numerical stability when drawing samples 
from the already rather numerically sensitive inverse-Gaussian distribution; see [Wheeler, 
2013] and the references there within. With Bayesian trend filtering, when an element of 
^£)(x,k+i) is too small, simulating a draw from [l/cu^l-] can return a value less than or equal 
to zero - obviously a problem for a distribution with non-negative support. To ameliorate 
such issues, we propose the admitedly inelegant solution of resampling the entire vector / if 
any element of jg jggg than 10“^°. We find that resampling happens less than five 
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percent of time. Further, when resampling does occur, it is extremely rare that more than 
one resample is ever needed. Despite such numerical issues, restricting the support of specihc 
posterior distributions does not appear to hinder Bayesian trend hltering’s performance; see 
Section 4.1. 


4 Numerical Results 

4.1 Simulations 

We compare Bayesian trend hltering (BTF), with both conditional priors dexp and gdp, 
against three different methods: trend hltering (TF) from Tibshirani [2014], Bayesian additive 
regression trees (BTree) from Chipman et al. [2010], Kapelner and Bleich [2013], and cubic 
smoothing splines GSM from Wood [2006]. BTree is included as it is becoming an increasingly 
popular Bayesian regression method. We include smoothing splines as they are arguably the 
most used method of smoothing, and also to highlight a different point than was made of the 
same comparison by Tibshirani [2014]. There, a strong argument was made for the efficiency 
of TF, implicitly dehned to be mean squared error (mse) per degree of freedom. In that 
world, TF clearly stands above as its asymptotic results are shown to beneht hnite sample 
sizes. Here, a more applied world is hypothesized, where estimation of A further dictates a 
methods performance. 

A functions used in the simulations are from various R [Core Team, 2014] packages. BTree 
was ht with bartMachine: :bartMachine using the default values of 50 trees and 9000 (after 
burn-in) iterations [Kapelner and Bleich, 2013]. GSM was £t using the cubic smoothing spline 
function mgcv: : gam, with all inputs used as knots (to make more fair the comparison between 
GSM and the trend filtering methods). TF was ht with genlasso: :trendf liter using a cubic 
piecewise polynomial, with both 5- and 10-fold cross validation [Tibshirani, 2014]. BTF, 
also using a cubic piecewise polynomial, was ht using btf :btf with 9000 (after burn-in) 
posterior samples [Roualdes, 2014]. Since the choice of hyperparameters a and p ehects the 
overall ht of both BTF-gdp and BTF-dexp, we explore Bayesian trend hltering’s responsiveness 
to changes in these hyperparameters. We ht BTF-dexp with a G {0.1, 0.5,1,1.5, 2} and 
p G {10“^, 10“^, 10“^, 0.1,1}, and BTF-gdp with each of a G {0.5, 0.1,1,1.5, 2} and p G 
{10“^, 10“^, 0.1,1}. We chose a number of “small” values of p to ensure a true thresholding 
rule, and thus encourage shrinkage [Armagan et ah, 2013]. 

For simulated data, we consider two univariate functions / : [0,1] M with 100 regularly 
spaced inputs. The first, a piecewise cubic function is borrowed from Tibshirani [2014]. The 
second is a difficult, spatially inhomogeneous function, colloquially known as dampened 
harmonic motion (dhm). 1000 replications of these two functions with three diherent levels 
of normal noise are evaluated upon three criteria. For each replication we calculate the 
mean and standard deviation of the mean squared errors (mse) and 95% conhdence intervals 
for both the underlying function evaluated at all inputs and the variance. 1000 bootstrap 
samples were used to estimate conhdence intervals of function estimates for TF, while GSM 
used the method predict.gam to obtain standard errors. All frequentist methods relied 
on the bootstrap to estimate We used posterior samples to create credible intervals for 
all the Bayesian methods. Thus, the 1000 replications are used to estimate the mean and 


standard deviation of the mses, and overall mean and standard errors of the two different 
coverage probabilities. 

The real world data consists of two data sets common to the smoothing literature. The 
hrst is a dataset of global mean surface temperature deviations for the years 1881 to 2005 
from Hodges [2013]. The second is the SILSO dataset, monthly sunspot counts for the years 
[1980,2014]. 

4.1.1 Piecewise Cubic 

Of the three levels of AA(0,(T^) noise, a G {0.75,1.0,1.25}, for the piecewise cubic function 
with a := 1.0, the right plot of Figure 3 shows the mses of 1000 replications of 100 obser¬ 
vations from the true function (solid black) displayed in the left plot. The BTF methods 
displayed in this plot used the hyperparameter values of a := 1 and p := 10“^. These plots 
are representative of all cases of the BTF methods for which p < 1, inclusive of all the tested 
values of a. BTF clearly produces mses with lower variance, although most methods save 
BTree (not shown) provide practically the same median mse. Still, BTF-gdp in Figure 3 pro¬ 
vided nearly the smallest set of mses: there are 45 and 25 hts from TF and CSM, respectively, 
where the mse is greater than the largest mse for BTF-gdp. And conversely, there are no hts 
that provide an mse smaller than the smallest mse from BTF-gdp. Table 1 provides mean 
and standard deviations of the mses for the different methods across all levels of noise chosen 
for the piecewise cubic function. 
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Figure 3: Both plots use noise level a := 1 and the BTF methods use a := 1 and p := 10“^. 
Left: The true piecewise cubic function (solid black) and the worst hts for the methods 
BTF-gdp (solid red) and CSM (dash green). Right: All 1000 replications of mses computed 
from hts of the labeled methods to the piecewise cubic function. 


The left plot of Figure 3 displays the hts with the greatest mse of BTF-gdp and CSM. 
The BTF methods appear to minimize the worst of the hts to the piecewise cubic function. 
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We highlight BTF-gdp because it almost unanimously outperforms BTF-dexp in all of our 
examples and we highlight CSM because it is the next best method, outside of the BTF 
methods, in terms of methods with extreme mse values. 



a = 

0.75 

a = 

1 

a = 

1.25 

method 

mean 

sd 

mean 

sd 

mean 

sd 

BTF-dexp 

8.9 

3.0 

16 

5.2 

25 

7.9 

BTF-gdp 

6.6 

2.4 

11 

4.2 

15 

6.4 

TF 

7.6 

5.9 

14 

11 

20 

17 

CSM 

7.2 

5.3 

12 

8.6 

18 

14 


Table 1: Mean and standard deviations of the 1000 mean squared errors for each of the three 
noise levels, rounded and multiplied by 100 for ease of comparison. The smallest value(s) 
within each column is(are) bold. The BTF methods used the hyperparameters a := 1 and 

p := 10-2. 

Consistent function estimation by BTF is seen when we compare coverage probabilities. 
Figure 4 plots the mean plus/minus two standard errors, at every level of noise, of function 
and variance coverage probabilities. Since we measured function coverage across the entire 
domain, the over-fitting of CSM and TF reduced their respective function coverage probabil¬ 
ities. Moreover, no method appropriately covered the variance at the nominal value. We 
notice, though do not provide plots of such here, that both BTF methods’ variance coverage 
declines as the hyperparameter p increases. 
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Figure 4: Each method’s function and variance coverage probablities are displayed for ev¬ 
ery level of noice considered for the piecewise cubic function. The BTF methods used the 
hyperparameter values a := 1 and p := IQ-^. 

While BTree (not shown) provides reasonably small mses for the piecewise cubic function, 
it had the highest median mse. This is largely due to the fact that it is not a smoothing 
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technique. BTree was removed from the plots throughout as it distracted from the main 
comparisons of interest. Still, BTree did quite well estimating the variance. In other contexts, 
where a smooth ht is not necessary, BTree has much to offer beyond what most of these 
smoothing techniques are able to handle. 

4.1.2 Spatially Inhomogeneous 

The second example uses the spatially inhomogeneous function 

f{x) = exp{—7.5a:} cos(1 Otto:), 

with Af{0,a^) noise and a G {0.025,0.05,0.075}. Figure 5 again plots the true function 
(solid black) on the left and all 1000 mses on the right, for the noise level a := 0.05. The 
BTF methods presented for dampened harmonic motion use the same hyperparameter values 
as for the piecewise cubic function, a := 1 and p := 10“^. Like with the piecewise cubic 
function, these hyperparameter values generally represent the all hyperparameter choices 
where p < 1. In Figure 5 we see that all trend hltering methods produce nearly identical 
median mses. The over-htting is still a problem for CSM and TF. Also, notice that the median 
mse for the cubic smoothing spline method CSM is a bit larger than the median mses for 
the trend hltering methods. Table 2 summarizes the mses for all levels of noise for the 
dampened harmonic motion function. The BTF methods provide the smallest mean mses, 
with the smallest standard deviation, for all levels of noise considered. 




Figure 5: Both plots use noise level a := 0.05 and the BTF methods use a := 1 and p := 10“^. 
Left: The true dampened harmonic motion function (solid black) and the worst hts for the 
methods BTF-gdp (solid red) and CSM (dash green). Right: All 1000 replications of mses 
computed from hts of the labeled methods to the dampened harmonic motion function. 

Generally with the dampened harmonic motion, as judged by mses, the BTF methods 
again perform well. In this case, BTF-dexp performs insignihcantly better than BTF-gdp. 
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This could be due to the heavy smoothing, required for dapmened harmonic motion, that 
the exponential tails of dexp encourage. Still, BTF-gdp gives smaller mean mses with smaller 
variation than the other methods. Greater than the largest mse of BTF-gdp there exists 42 
and 6 mses as produced by TF and CSM, respectively. Only TF provides any, specifically two, 
simulations with a smaller mse than that of BTF-gdp. 



a = 

0.025 

cr = 

0.05 

a = 

0.075 

method 

mean 

sd 

mean 

sd 

mean sd 

BTF-dexp 

1.3 

0.39 

4.7 

1.5 

9.9 

3.1 

BTF-gdp 

1.3 

0.39 

4.8 

1.6 

11 

3.9 

TF 

1.5 

0.92 

5.4 

3.9 

11 

7.2 

CSM 

1.9 

0.53 

6.2 

2.0 

12 

4.2 


Table 2: Mean and standard deviations of the 1000 mean squared errors for each of the 
three noise levels of the dampened harmonic motion function, rounded and multiplied by 
10^ for ease of comparison. The smallest value(s) within each column is(are) bold. The BTF 
methods used the hyperparameter values a := 1 and p := 10“^. 

Figure 6 shows function and variance coverage probability means plus/minus two stan¬ 
dard errors for the all levels of noise of the dampened harmonic motion function. Function 
estimation proved difficult for all the methods. This is likely do to the fact that we are mea¬ 
suring overall function coverage on a quite spatially inhomogeneous function. Still for many 
values of the hyperparameters, the BTF methods perform well in terms of function coverage 
probability, as is seen in the left plot of Figure 6. As for variance coverage, BTF-dexp does 
noticeably worse than BTF-gdp, which itself outperforms the other methods. 



Variance coverage 


0.95 - 


0.90- 

!5 

CO 

i.0.85 - 

CD 


O 

“ 0.80 - 


0.75- 


BTF-dexp BTptgdp TF CSM 


Figure 6: Each method’s function and variance coverage probablities are displayed for every 
level of noise considered for the dampened harmonic motion function. The BTF methods 
used the hyperparameter values a := 1 and p := 10“^. 
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4.2 Real Data Analysis 

4.2.1 Surface Temperatures 

The global surface temperature deviations data consist of yearly, [1881, 2005], measurements 
in units of 0.01 °C. Figure 7 displays various fits to these data. The global surface tem¬ 
perature data proves to be a problem for both 5/10-fold cross validation used by TF (dash 
black), as is seen in the right plot of Figure 7. Trend hltering essentially hts every data point. 
BTF-gdp (solid green), where a := 1 and p := 10“^, smooth these data toward the hypoth¬ 
esized underlying function as seen in the left plot of Figure 7. Because the trend hltering 
methods assume independent observations, possibly an incorrect error strucutre for these 
data, we also ht CSM with an autoregressive one, denoted AR(1), error structure (solid red). 
We hnd that the estimates of BTF-gdp and CSM with an AR(1) error structure, displayed in 
the left plot of Figure 7, are quite similar. In fact, BTF-gdp’s credible intervals (not shown) 
completely contain the CSM autoregressive one ht. 




Figure 7: Plotted are the 125 annual, global mean surface temperature deviations from years 
1881 until 2005, inclusive [Hodges, 2013]. Left: The hts of CSM (dash black), CSM with an 
AR(1) error structure (solid red), and BTF-gdp with a := 1 and p := 10“^ (solid green) are 
displayed. Right: Two estimates are shown, BTree (solid red) and TF using 10-fold cross 
validation (dash black). 

4.2.2 Sunspots 

The second applied data set we investigate is monthly sunspot counts over the years 1980 to 
2014, with 402 observations [Center, 1980-2014]. In the left plot of Figure 8, BTF-gdp (solid 
green) with a := 1 and p := 10“^ provides a nice smooth ht to these data. In the right plot 
of Figure 8, we see that TF (dash black) using 5-fold cross validation provides a noisy ht to 
these data. Both 5- and 10-fold cross validation provide nearly the same ht to these data. 
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BTree (solid red), in the left plot, provides a slightly less noisy estimate of these data than 
did TF. 




Figure 8: Plotted are the 402 monthly sunspot observations from 1980 to 2014 [Center, 
1980-2014], Left: The hts of CSM (dash black), CSM with an AR(1) error structure (solid 
red), and BTF-gdp with a := 1 and p := 10“^ (solid green) are displayed. Right: Two 
estimates are shown, TF (dash black) with 5-fold cross validation and BTree (solid red). 

Similar concerns about a possibly misspecihed error structure exist for these data when 
applying the trend hltering methods. We therefore ht CSM with an AR(1) error structure 
(solid red), as seen in the left plot of Figure 8. As with the temperature data, the BTF-gdp 
(solid green) and CSM-AR(l) hts are quite similar, again the estiamtes of CSM-AR(l) are 
completely contained by the credible intervals (not shown) of BTF-gdp. It is unclear which 
assumption is more appropriate for the underlying physical process of these data: correlated 
errors as in CSM with an AR(1) error structure, or a function with evaluations correlated 
across time and uncorrelated errors, as Bayesian trend hltering assumes. 


5 Discussion 

We developed a Bayesian, nonparametric smoother that hnds its origins in the lasso literture. 
Our method uses a simple scale mixture of normals to build a fully Bayesian hierarchical 
model. The hierarchical model of Bayesian trend hltering closely resembles the structure of 
Gaussian process regression, albeit with a unique covariance function that is best understood 
in terms of the underlying penalty; see Rasmussen [2006] for a thorough review of Gaussian 
processes. Bayesian trend hltering, in this way, lives in the intersection of ^l penalized 
regression, namely the (generalized) lasso, and the Gaussian process prior literature. From 
this vantage point, we found two distinct bodies of research that oher the framework for a 
proof of the convergence rate of Bayesian trend hltering. 
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Bayesian trend filtering is closely related to the work of Mammen and van de Geer [1997] 
on locally adaptive regression splines. To show that locally adaptive regression splines are 
convergent at the minimax rate, Mammen and van de Geer [1997] nse the metric entropy cal¬ 
culations of Van de Geer [1990], Mammen [1991] and also interpolating properties of splines 
developed by de Boor [2001]. Building on this work, Tibshirani [2014] proves a similar con¬ 
clusion for trend hltering. On the other hand, Bayesian trend hltering uses a hierarchical 
model common to Gaussian process regression. Some in the Gaussian process literature 
have a similar goal in mind: prove rates of convergence of posterior distributions that are 
based on Gaussian process priors [Ghosal et ah, 2007, Van der Vaart and Van Zanten, 2008, 
Van Der Vaart and Van Zanten, 2011]. This work on Gaussian process priors, which con¬ 
tains Bayesian trend hltering, notes that at the heart of the metric entropy proofs relating to 
nonparametric regression, there lies interpolating properties of the function space of interest 
[Ghosal et ah, 2007]. Though we were not able to prove such interpolation properties about 
the space of piecewise polynomials, this connection between penalized regression techniques 
and the Gaussian process regression methods is quite interesting. 

Bayesian trend hltering has good potential in application. Based on our simulations, we 
hnd that Bayesian trend hltering has good estimation and strong frequentist properties, as 
compared to the original trend hltering and a popular cubic smoothing spline method. These 
benehts come from both the ^l penalty, which acts similar to the total variation penalty, 
and the ability to estimate well the penalty parameter A. We also found that the generalized 
double Pareto conditional prior provides a substantial increase in Bayesian trend hltering’s 
accuracy and coverage probablities. Overall, Bayesian trend hltering decreases estimation 
error compared to the other methods we tested. 

Bayesian trend hltering’s benehts though come at some cost. Bayesian trend hltering 
relies on a matrix inversion within the full conditional [/!•] at every iteration. This matrix 
inverse is the most signihcant computational cost of Bayesian trend hltering, and it puts 
the method’s computational complexity to be 0{n^). Though, it should be reemphasized 
that depsite the computational burden, more information is gained from a Bayesian trend 
hltering ht. Some in the Gaussian process literature developed means to avoid such an 
inverse in special cases, for instance see Vehtari and Vanhatalo [2007], Liu et al. [2014], but 
these methods are not directly applicable to Bayesian trend hltering. 

A simplistic strategy to reduce the computation time for Bayesian trend hltering uses 
the Gibbs sampler’s fast convergence rate. From Proposition 3.1 and from our simulations, 
we know that the Gibbs sampler of Bayesian trend hltering converges very quickly. We 
could reduce computational complexity by sampling from the full conditional for / every 
mth iteration, while sampling all other full conditionals every iteration. This would reduce 
the computational burden of the full conditional [/!•], but at the same time produce larger 
ehective sample sizes for the other parameters of interest. We reht Bayesian trend hltering 
using this idea with m = 2. Table 3 displays the computation times of each method as ht 
to the real data sets discussed above. Bayesian tren hltering’s estimate of the underlying 
function at each input x* changed very little between the two hts, sampling every iteration 
(m = 1) verse sampling every other iteration (m = 2). The mean relative diherence of 
the posterior mean of function evaluations at each input Xi are 0.0028 and 0.0009, for the 
temperature and sunspot data, respectively. 

The proposed hierarchical model for Bayesian trend hltering is essentially the complete 
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dataset BTF(m=l/m=2) TF(k=5/10) CSM CSM-AR(l) BTREE 

temperature 7/4 10/20 0.2 0.5 9 

sunspots 98/51 46/94 7 13 19 


Table 3: Computation times in seconds for each method against each real data set. 


framework for Bayesian generalized lasso. With this many other models could be carried 
over into the Bayesian framework and with a variety of variations on the penalty presented 
here. For instance, the elastic net penalty might improve the accuracy of Bayesian trend 
hltering. Futher, Bayesian trend hltering itself could be modihed in a number of interesting 
ways. For instance, an additive model, E?/* = is desirable. However, because of 

the large computational complexity of Bayesian trend hltering it seems necessary to hrst rid 
this method of its matrix inversion. Possibly some approximate Bayesian sampling technique 
is suitable. This, together with the work on the minimax convergence rate proof via metric 
entropy methods, are left for future work. 

A Majorization-Minimization Algorithm 

We offer an approximation to the objective function (3). The approximation uses the 
majorization-minimization techniques developed by Hunter and Li [2005]. Convergence, up 
to numerical precision, is nearly guaranteed since equation (3) is convex. 

Definition A.l (Majorization). Let 6^"^^ be the iteration in a search for the minimum 
value of a function f{9). A function g{9\9^'^^) is said to majorize the real-valued function 
f{9) at the point 9^"^^ if 


V0, and 

Minimization of the function of interest is established by repeated minimization of the 
majorizor and some stopping criterion is satished, or when a maximum number of iterations 
is reached.. The stopping criteria considered here is that of stability of the estimates, i.e. 
the algorithm stops when \ — f^'^^Woo < t, where r := 10“® was chosen. 

For some e > 0 and specihed value Xcv, fhe following is a majorization of (3) 

9(/l/™) = \\y-f\\l + Lv -^log (i + 

Unfortunately, the e is not easily avoided as division by zero is otherwise encouraged. Nu¬ 
merical precision becomes more and more of an issue as smaller values of e, r are chosen. 
Despite these issues with this approximation strategy, in our experience the mean absolute 
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difference between this approximate solution and genlasso: :trendf liter’s exact calcula¬ 
tion was generally around 10“3. 
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